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Psoriasis is a common chronic inflammatory disease of the skin. We sought to use bacterial community 
abundance data to assess the feasibility of developing multivariate molecular signatures for differentiation 
of cutaneous psoriatic lesions, clinically unaffected contralateral skin from psoriatic patients, and similar 
cutaneous loci in matched healthy control subjects. Using 16S rRNA high-throughput DNA sequencing, we 
assayed the cutaneous microbiome for 51 such matched specimen triplets including subjects of both 
genders, different age groups, ethnicities and multiple body sites. None of the subjects had recently received 
relevant treatments or antibiotics. We found that molecular signatures for the diagnosis of psoriasis result in 
significant accuracy ranging from 0.75 to 0.89 AUC, depending on the classification task. We also found a 
significant effect of DNA sequencing and downstream analysis protocols on the accuracy of molecular 
signatures. Our results demonstrate that it is feasible to develop accurate molecular signatures for the 
diagnosis of psoriasis from microbiomic data. 



Psoriasis is a chronic inflammatory disease of unknown cause. Few studies of the microbiota in psoriatic 
patients have used molecular methods for the detection of bacterial and fungal taxa 1 \ Such studies have 
involved relatively small numbers of subjects, and relatively low depths of coverage. Similarly, no studies 
have developed multivariate molecular signatures of psoriasis from high-throughput molecular data from the 
skin and assessed their accuracy. By "multivariate molecular signature" or "molecular signature" of psoriasis, we 
mean a multivariate computational or mathematical model that can either classify or diagnose psoriasis from 
molecular data. Studies that have generated high-throughput molecular data from the psoriatic skin lesions focus 
primarily on differential gene expression 4 6 . Outside of the domain of psoriasis, there have been few efforts to 
develop and assess multivariate molecular signatures using microbiomic data for clinical tasks. Most related 
efforts involved classification of body sites or subject/population identification 7 9 . 

As part of the Human Microbiome Project of the National Institutes of Health, we sought to assess feasibility 
and compare methodologies for developing molecular signatures of psoriasis using microbiomic data from the 
skin. Because the cutaneous microbiota is complex 10,11 , and its composition is site-specific 12,13 , we matched 
lesional skin with unaffected contralateral skin from the same subject. Likewise, the samples from demograph- 
ically matched healthy controls were collected at standardized sites where psoriasis lesions commonly appear. 
The experimental specimen matching and sample collection protocols were designed to measure predictive 
microbial markers that are common across body-sites regardless of the inter-site variability. Each site may have 
additional specific markers which may not be detected by the current design. We used this set of specimens to 
obtain high-throughput sequencing of the 16S rRNA gene for both the V1-V3 and the V3-V5 loci 14 . Using 
resulting microbiomic data, we then developed and evaluated multivariate molecular signatures of psoriasis 
aimed at the separation of cutaneous psoriatic lesions, the clinically uninvolved contralateral skin, and the similar 
skin loci in matched healthy control subjects. The availability of two different datasets, based on the V1-V3 and 
the V3-V5 16S rRNA gene loci, enabled us to study the effect of the DNA sequencing protocol on the classifica- 
tion accuracy of molecular signatures. Similarly, we experimented with 37 protocols for downstream analysis by 
using various methods for selecting features/taxa (taxonomic features) for inclusion in the molecular signatures. 
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This approach allowed us to study the effects of downstream analysis 
protocols on the classification accuracy of molecular signatures. 

Results 

Microbiomic signatures can diagnose psoriasis with high accu- 
racy. Using the high-throughput 16S rRNA gene survey data from 
the V3-V5 locus, we developed and evaluated molecular signatures 
of psoriasis for four binary classification tasks aimed at separating 
psoriasis lesions (PL: psoriatic plaque), psoriasis normal (PN: 
contralateral area of clinically uninvolved skin), and healthy 
control (CC) samples: (i) PN vs. CC, (ii) PL vs. CC, (iii) PL vs. PN, 
and (iv) CC vs. PL and PN. 

The resulting classification accuracy (AUC) and the number of 
taxa participating in molecular signatures derived from the data from 
the V3-V5 locus are shown in Table 1, Panel A. Using very few taxa 
(2-4) it is possible to separate classes of samples with high and 
statistically significant classification accuracy. Of the four classifica- 
tion tasks, separation of CC vs. PL and PN yields the highest clas- 
sification accuracy (AUC = 0.894), while separation of PN vs. PL has 
lower (AUC = 0.754) but still statistically significant classification 
accuracy (Table 1, Panel A, Columns 1 and 2). The number of taxa 
in the constructed molecular signatures is provided in Table 1, Panel 
A, Columns 3 and 4. Detailed information on (a) the features/taxa 
selected by GLL in all data as well as (b) the 20 most frequently 
selected features/taxa by GLL in the training sets during cross-valid- 
ation is provided in Supplementary File 1 . The file reports individual 
AUC's of the features as well as p-values from Fisher's Z-test; both 
AUC and p-values were computed on all data samples. In addition, 
the file reports frequency of selection for the 20 most frequently 
selected features/taxa (the maximum frequency is 500 because this 
is the number of training sets in which GLL was applied). 
Supplementary File 3 provides a list of the 50 most univariately 
discriminative taxonomic features for each classification task along 
with their individual AUC's and mean relative abundances per 
group. Features in this list that have also been selected by GLL in 
all data are highlighted with yellow. 



DNA sequencing protocol affects the accuracy of microbiomic 
signatures. We also investigated the effect of using a survey of 
microbiota from a different region (V1-V3) of the 16S rRNA gene 
on classification accuracy of molecular signatures by repeating the 
same experiment described above. We focused on the same four 
binary classification tasks (PN vs. CC, PL vs. CC, PL vs. PN, and 
CC vs. PL and PN) and followed a nested cross-validation design to 
select features/taxa and build molecular signatures. 

The resulting classification accuracy (AUC) and the number of 
taxa participating in the molecular signatures are shown in Table 1, 
Panel B. Detailed information on features/taxa selected by GLL is 
provided in Supplementary File 2. Information about the 50 most 
univariately discriminative taxonomic features for each classification 
task is provided in Supplementary File 4. Only the separation of PL 
from CC has statistically significant accuracy. The observation that 
other classification tasks do not yield statistically significant accura- 
cies indicates that the V1-V3 locus, unlike V3-V5, captures less 
information about the different skin types. Although the V1-V3 
based survey can still be useful for diagnostic purposes, the ability 
to distinguish between the unaffected skin types (normal vs. control) 
or the skin within the affected individuals (normal vs. lesion) has 
been lost compared to V3-V5. The general reason for differences in 
performance is because amplification of V1-V3 and V3-V5 regions 
provides different surveys of the variability of 16S rRNA gene in the 
communities. 

Downstream analysis protocol affects the accuracy of micro- 
biomic signatures. Table SI in Supplementary Information 

describes all downstream analysis protocols evaluated using the 
16S rRNA data both from the V1-V3 and the V3-V5 loci for all 
four binary classification tasks (PN vs. CC, PL vs. CC, PL vs. PN, and 
CC vs. PL and PN). Each protocol differs from the others by using a 
different feature/taxa selection method. 

The resulting classification accuracy (AUC) and the number of 
features/taxa participating in molecular signatures is shown for 
results averaged over all four classification tasks (Figure 1) and for 
individual results for each classification task (Figure 2) for data from 



Table 1 | Classification accuracy (AUC) and the number of taxa in the molecular signatures of psoriasis constructed by the GLL and SVM 
algorithms using the V3-V5 1 6S rRNA locus data (Panel A) and using the VI -V3 1 6S rRNA locus data (Panel B)* 



Panel A: V3-V5 1 6S rRNA locus 



1 



Classification accuracy (AUC) Number of selected taxa 

Classification task Cross-validation estimate Statistical significance (p-value) From cross-validation (mean) From the entire dataset 

PNvs.CC 0.854 <0.001 2.8 2 

PL vs. CC 0.806 0.002 2.5 2 

PLvs.PN 0.754 0.004 2.1 3 

CCvs.PLandPN 0.894 <0.001 3.7 4 

Panel B: VI -V3 16S rRNA locus 



Classification accuracy (AUC) Number of selected taxa 



Classification task Cross-validation estimate Statistical significance (p-value) From cross-validation (mean) From the entire dataset 

PNvs.CC 0.405 0.985 2 1 

PL vs. CC 0.751 <0.001 3.8 4 

PLvs.PN 0.576 0.080 3.1 3 

CCvs.PLandPN 0.482 0.618 4.2 3 

'Results for molecular signatures with statistically significant classification accuracy (at 5% alpha-level adjusted for multiple comparisons) are shown with bold font. Column 4 reports the number of selected 
taxa when the GLL method was applied to the entire dataset (all samples). Column 3 reports the number of taxa selected on average in all training sets of five-fold cross-validation (repeated 1 00 times with 
different splits into five folds). These represent results of the application of GLL to (5 X 100 = 500) training sets of the dataset with 80% of samples. 
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Figure 1 | Classification accuracy (AUC) versus number of selected features/taxa for 37 feature selection methods averaged over 4 classification tasks 
(PN vs. CC, PL vs. CC, PL vs. PN, and CC vs. PL and PN) in data based on the V3-V5 16S rRNA locus. Methods from the same algorithmic family are 
shown with the same markers in the figure. The pink area contains methods that have nominally higher classification accuracy (AUC) than GLL. The 
green area contains methods that have selected fewer taxa than GLL. The red dash-dotted line indicates a Pareto frontier constructed over non-GLL 
methods. Methods on the Pareto frontier are such that no other non-GLL method has both higher AUC and a smaller number of selected features 
averaged over the four classification tasks. 



the V3-V5 locus. Detailed results and statistical comparison are 
given in Table S2 (for classification accuracy) and Table S3 (for 
number of selected features/taxa) in Supplementary Information. 
With respect to classification accuracy, there are 8 methods (includ- 
ing GLL) that yield molecular signatures with statistically significant 
classification accuracies for all 4 tasks (GLL, UAF-KW1, UAF-SN1, 
UAF-T1, UAF-T2, RFVS1, RFVS2, LARS-EN1). The GLL feature/ 
taxa selection method outperforms 34 of the 36 comparator methods 
on average over four classification tasks. While there are 2 methods 
(RFVS1 and RFVS2) that yield slightly higher classification accuracy 
than GLL, the differences in accuracy are not statistically significant. 
With respect to the number of selected features/taxa, GLL selects 
fewer features/taxa than 24 of the 36 comparator methods (including 
RFVS1 and RFVS2) on average over the four classification tasks. 
There are 12 methods that select fewer features than GLL (6 of these 
12 methods select significantly fewer features), however there is no 
single method that both selects fewer features and yields higher clas- 
sification accuracy than GLL. Combined with the previous obser- 
vation that there is no method that yields significantly higher 
classification accuracy than GLL regardless of the number of 



features, we conclude that GLL is the optimal method for the 
development of molecular signatures from this data. 

Detailed results and statistical comparison for data using the VI- 
V3 locus are given in Table S4 (for classification accuracy) and Table 
S5 (for number of selected features/taxa) in Supplementary 
Information. None of the other comparator methods yield molecu- 
lar signatures with statistically significant classification accuracies for 
all 4 tasks. Therefore, we did not perform comparison between meth- 
ods using the data from the VI -V3 locus. 

Discussion 

The original analysis of the VI -V3 dataset (Alekseyenko AV, Perez- 
Perez GI, D'Souza A, Strober B, Gao Z, Bihan M, Li K, Methe BA, 
Blaser MJ: Population differentiation of the cutaneous microbiota in 
psoriasis, forthcoming) consisted of ecological analysis, which iden- 
tified increasing intra-group (Unifrac) beta-diversity in the psoriasis 
specimens (diversity CC < PN < PL). The PL and PN specimens also 
showed decreased taxonomic richness and distribution evenness 
(Shannon index) compared with the CC (Control samples). These 
differences could be explained by the combined increased abundance 
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Figure 2 | Classification accuracy (AUC) versus number of selected features/taxa for 37 feature selection methods for each of the four classification 
tasks (PN vs. CC, PL vs. CC, PL vs. PN, and CC vs. PL and PN) in data based on the V3- V5 16S rRNA gene locus. Methods from the same algorithmic 
family are shown with the same markers in the figure. The pink area contains methods that have nominally higher classification accuracy (AUC) 
than GLL. The green area contains methods that have selected fewer taxa than GLL. The red dash-dotted line indicates a Pareto frontier constructed over 
non-GLL methods. Methods on the Pareto frontier are such that no other non-GLL method has both higher AUC and a smaller number of selected 
features for each classification task. 



of the four major skin-associated genera (Corynebacterium, Propio- 
nibacterium, Staphylococcus, and Streptococcus). The univariate ana- 
lysis using the AUC determined that the three genera Cupriavidus, 
Methylobacterium, and Schlegelella statistically significantly (at 5% 
alpha-level) classify the groups of specimens. Interestingly, these 
genera also often participate in the multivariate molecular signatures 
of psoriasis constructed in the present study (detailed comparison is 
provided in Table S6 in Supplementary Information). Finally, based 
on partitioning around medoids clustering, the ecological study 
identified two distinct cutaneotypes: (1) Proteobacteria-assocmted, 
and (2) Firmicutes- and Actinobacteria-assocmted. Cutaneotype 2 



was enriched in PL specimens compared to CC (OR 3.52 [1.44- 
8.98], p < 0.01). 

A useful theoretical property of the GLL method is that under 
broad distributional assumptions, it provably discovers the local 
pathway membership around the response variable of interest 15 . 
Specifically, this means that under sufficient assumptions of the 
method 15 , which are commonly accepted and used in causal discov- 
ery research 16 , GLL will output from observational data the set of 
direct causes, direct effects, and (optionally, depending on the ver- 
sion of GLL) direct causes of direct effects of the response variable 
(in this study, the response variable is a binary indicator of the 
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phenotype). This theory has been tested in both biological and non- 
biological fields; see for example 15,17,18 . The output of non-causal 
feature selection methods, which are comparator methods used in 
this study, cannot in general be interpreted causally and often results 
in associated variables that do not have a causal relation with the 
phenotype; they can merely be passengers 15,17,19 . Nevertheless, we 
consider the current work to be an initial step in the discovery pro- 
cess. To causally validate and interpret the findings, several consid- 
erations must be addressed: the genetic composition and immune 
status of the host, prior exposure and entrance routes of potential 
pathogens, pathogenetic mechanisms of implicated taxa, and func- 
tional assessment of the impact of microbial community composi- 
tion on the observed phenotype. 

In conclusion, this work demonstrates that it is possible to develop 
accurate molecular signatures for the diagnosis of psoriasis from 
microbiomic data. We have shown that the accuracy of molecular 
signatures depends on the DNA sequencing and downstream ana- 
lysis protocols. Using the 16S rRNA data from the V3-V5 locus leads 
to accurate and statistically significant molecular signatures, whereas 
data from the V1-V3 locus carries limited diagnostic signal. With 
respect to the downstream analysis protocol, our results suggest that 
the Markov boundary-based Generalized Local Learning (GLL) 
method works well for this type of data, returning molecular signa- 
tures based on very few (2-4) features/taxa with statistically optimal 
classification accuracy. Furthermore, the features/taxa selected by 
GLL have putative causal interpretation. This warrants their further 
study and quantifying their effect on phenotype. The approach pre- 
sented in this study was designed as a general-purpose downstream 
analysis methodology and as such may facilitate the design and 
development of microbiomic molecular signatures of other diseases 
in addition to psoriasis. 

Methods 

Study population. We obtained informed written consent (using the model consent 
forms for the HMP demonstration projects) and enrolled a total of 199 subjects (75 
patients with psoriasis and 124 healthy controls) under New York University School 
of Medicine IRB #08-709 between June 2008 and September 2011. Psoriatic patients 
were diagnosed at an academic dermatology practice, and psoriasis was clinically 
classified based on characteristic morphologic features of the individual skin lesions 
and their cutaneous distribution. Among the patients with psoriasis, 57 (76%) had not 
been exposed to antibiotics or received a relevant clinical treatment for at least one 
month before skin samples were obtained, and of these, a total of 54 patients were 
studied. Among the healthy controls, 112 (90.3%) had not been exposed to antibiotics 
or received any relevant treatment for at least one month before skin samples were 
obtained. 

For the 54 subjects with psoriasis, we sought control subjects of the same gender 
and ethnicity, and of similar age (±5 years), for whom a cutaneous specimen was 
obtained in a region proximate to the psoriasis lesion. In total, we obtained matching 
control specimens from 37 (29.8%) subjects. One or more sites from each of these 
controls were matched to the 54 subjects with psoriasis. A control subject could be 
matched to more than one patient, since we also matched for cutaneous site. 
However, each control cutaneous site was uniquely mapped to only one triplet, thus 
there was no duplication of specimens in the analysis. 

Specimen collection and processing. In the patients with psoriasis, we sampled a 
typical psoriatic plaque (designated as Psoriasis, lesion), and as a reference site, the 
contralateral area of clinically uninvolved skin (designated as Psoriasis, normal). We 
also sampled skin from matched healthy (control) individuals at locations that 
approximated the location of the psoriatic lesion. We accomplished this by obtaining 
4 skin swabs from each control person, from scalp (posterior-temporal, above the ear 
crease), elbow (inner aspect), lower lateral abdomen, and knee. The selected sampling 
sites reflect common general locations on the head, torso, and extremities where 
psoriatic lesions arise in patients 2 . Both the lesions and the control specimens were 
predominantly obtained from dry and oily cutaneous sites 12 , reflecting the 
distribution of psoriatic lesions in general 1 . 

Methods for specimen processing are described elsewhere 20 . In brief, a 2 by 2-cm 
area of the cutaneous surface at each of the locations was sampled by swabbing the 
skin with a cotton pledget that had been soaked in sterile 0.15 M NaCl with 0.1% 
Tween 20 (Fisher Scientific, Fair Lawn NJ). DNA was extracted from the swab 
suspensions in a PCR-free clean room by using the DNeasy blood and tissue kit 
(Qiagen, Chatsworth CA); glass beads (0.5 to 1 mm) were added to the specimens and 
vortex-mixed at maximum speed for 40 s, followed by DNA extraction, using the 
manufacturer's protocol for genomic DNA isolation from Gram-positive bacteria, 
and samples were eluted in 100 ul AE buffer (DNeasy Blood and Tissue kit; Qiagen). 



To eliminate bacterial or DNA contamination, lysozyme ( Sigma- Aldrich, St. Louis 
MO) was passed through a microcentrifuge filter (molecular mass threshold, 
30,000 Da; Amicon, Bedford MA) at 18,514 X g for 20 min before adding to the 
enzymatic lysis buffer. 

DNA sequencing and upstream processing. Samples were prepared for 
amplification and sequencing at the J. Craig Venter Institute ( JCVI) Joint Technology 
Center (JTC) 14 . Genomic DNA sample concentrations were normalized. The V1-V3 
region of the 16S rRNA gene was amplified using forward primer attached to the 
Roche B adapter for 454-library construction and reverse primer attached to the 
Roche A adapter and a 10-nt barcode [5'-A-adapter-N (10) + 16S primer-3]. A 
barcoded primer design was completed using a set of algorithms developed at the 
JCVI. PCR reactions were completed as follows (per reaction): 2 ul of gDNA, 1 X final 
concentration of Accuprime PCR Buffer II (Invitrogen, Carlsbad CA, USA), 200 nM 
forward and reverse primers, 0.75 U of Accuprime TaqDNA polymerase high fidelity 
(Invitrogen), and nuclease-free water to bring the final volume to 20 ul. PCR cycling 
conditions consisted of an initial denaturation of 2 min at 95°C, 30 cycles of 20 s at 
95°C, 30 s at 50°C, and 5 min at 72°C. A negative control (water blank) reaction was 
examined after 35 cycles. Samples then were quantified, cleaned, and sequenced on 
the Roche 454-FLX (454 Life Sciences, Branford CT, USA) as described 21 , and a read 
processing pipeline consisting of a set of modular scripts designed at the JCVI were 
employed for deconvolution, trimming, and quality filtering, as described 
previously 21 . Finally, we included in our analysis only samples with more than 2,000 
reads (the median number of reads per sample was 8,621). The sequencing effort was 
statistically similar across clinical skin types. The above protocol was repeated to 
obtain sequencing data from the V3-V5 locus of the 16S rRNA gene. 

As a result, we obtained two 16S rRNA datasets (one for the V1-V3 locus and 
another one for the V3-V5 locus) with the number of evaluable samples for each class 
given in Table 2. The number of samples in the dataset for the V3-V5 locus was 
smaller than for the V1-V3 locus primarily because of the lower sequencing quality 
and yield at the former locus, however as shown in the Results section the number of 
samples was sufficient to obtain molecular signatures and evaluate their statistical 
significance. The passing sequences were analyzed using the RDP classifier executed 
at a bootstrap confidence cut-off of 80% 22 and were normalized. This criterion 
resulted in processed datasets with 791 and 660 taxonomic features (relative taxo- 
nomic abundance at phylum, class, order, family, and genus levels) for the VI -V3 and 
the V3-V5 loci, respectively. These datasets were used for development of molecular 
signatures and assessment of their classification accuracy with the methods described 
in the following subsection. 

Downstream analyses for development of microbiomic signatures. The features/ 
taxa for inclusion in molecular signatures were selected by several feature selection 
methods. We used Generalized Local Learning (GLL), a supervised multivariate 
method, as a key technique for feature selection in this study 15 ' 19 . Under broad 
distributional assumptions, the taxa selected by GLL are theoretically guaranteed to 
achieve both maximal classification accuracy and maximal parsimony for the dataset 
at hand 1519 . In addition, because the GLL method is based on Markov Boundary 
theory (a branch of formal causal graph induction), under broad distributional 
assumptions, it returns the local pathway members around the response variable of 
interest. GLL was executed with a default setup: Fisher's Z-test was used for assessing 
conditional independence at a 5% alpha level, and the max-k parameter, which 
represents the maximum size of the conditioning sets, was set to 3. In addition to GLL, 
we applied 36 comparator methods/ variants that have been successful in other types 
of genomic data, as well as recent applications to microbiomics 7 ' 23 . The reasons for 
such an extensive range of methods is to shed light on relative strengths and 
weaknesses of state-of-the-art methods for microbiomic molecular signature 
development, and to ensure that we obtain as highly predictive and as compact 
signatures as possible from our data. In brief, these methods include: support vector 
machine-based recursive feature elimination (SVM-RFE) 24 ; univariate methods 
based on various statistical tests (Kruskal-Wallis non-parametric one-way ANOVA 
test, signal-to-noise ratio, ratio of between-group to within-group sum of squares, 
two-sample t-test, and x 2 -test 23 ' 25 " 27 , denoted as UAF-KW, UAF-S2N, UAF-BW, 
UAF-T, and UAF-X2, respectively); minimum redundancy and maximum relevancy 
methods (MRMR) 28,29 ; random forest-based variable selection (RFVS) 30 ; least angle 
regression elastic net (LARS-EN) 31 ; soft independent modeling of class analogy 
(SIMCA) 32 ; principal component and sparse principal component analysis-based 
methods (PCA and SPCA) 33 ; threshold gradient descent regularization (TGDR) 34 ; 
and using all features/taxa in the dataset without feature selection (ALL). All feature/ 
taxa selection methods and details about their variants are listed in Table SI. 

Once the features/taxa were selected, molecular signatures were developed using 
the support vector machine (SVM) supervised classification algorithm 35 . We chose to 
use SVM classifiers because they have high utility in many genomic datasets 23 , and 
have outstanding empirical performance in microbiomic data (see 36 and chapter 6 
of 23 ). Furthermore, several theoretical considerations warrant application of SVMs to 
microbiomic data: SVMs perform well in data with limited sample size, are relatively 
insensitive to high dimensionality of the data, prevent overfitting by using regular- 
ization techniques, and can learn both simple and complex decision functions 35 ' 37,38 . 
SVMs were applied with a polynomial kernel; the degree of the kernel (d) and penalty 
parameter (C) were optimized by nested cross-validation over values d — (1,2,3) and 
C = (0.01, 0.1, 1, 10, 100) as detailed below. We used the libSVM version 3.12 
implementation of SVMs 39 . 
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Table 2 | Number of samples for each class in 1 6S rRNA gene data from the VI -V3 and the V3-V5 loci 

Psoriasis, lesion (PL) Psoriasis, normal (PN) Healthy controls (CC) 

VI -V3 locus 51 51 49 

V3-V5 locus 22 22 22 



For estimation of classification accuracy and optimization of classifier hyper- 
parameters, we used nested five-fold cross-validation repeated 100 times with dif- 
ferent splits of the data into five folds 40,41 . This protocol proceeded as follows. First, all 
samples in the data were randomly split into five non- overlapping and approximately 
equal parts (folds) such that each fold retained a similar proportion of samples from 
two classes. Next, four of five folds {80% of the data; "original training set") were used 
for feature/taxa selection and development of the molecular signature, and the 
remaining fold (20% of the data; "original testing set") was used for estimation of 
classification accuracy of the molecular signature. The molecular signature was 
developed and evaluated five times so that each of the five folds could play the role of 
the original testing set, and its complement could play the role of the original training 
set. Then the entire process was repeated 100 times with different splits of the data 
into five folds. The resulting classification accuracy estimate was the mean over five 
different testing sets from each of 100 different data splits into folds. Since the 
methodologies for feature/taxa selection and development of molecular signature 
have parameters that require tuning, each of the original training sets (consisting of 
four folds) was further split into a smaller training set with three folds and a validation 
set with one fold, and a four-fold cross-validation protocol was used to estimate 
accuracy of each parameter configuration and select the one with the best average 
accuracy for application to the original training set. This protocol avoids overfitting 
and yields unbiased estimates since the testing set is not used for development of the 
molecular signature (i.e., no "information leak" occurs from the response variable 
back to the model selection, feature selection, and classifier fitting procedures). 

As a metric of classification accuracy, we used the area under the ROC curve 
(AUC) 42,43 . This metric has greater statistical power to detect signal than the com- 
monly used proportion of misclassifications and is insensitive to the ratio of cases to 
controls 44 . The ROC curve is the plot of sensitivity versus one minus the specificity for 
a range of classification threshold values. AUC ranges from 0 to 1, with an AUC equal 
to 0 indicating the classifier opposite to the perfect, 0.5 indicating a random (i.e., 
uninformative) classifier, —0.70-0. 75 indicating a mediocre classifier, —0.80-0.85 
indicating a very good classifier, —0.90 or better indicating an excellent classifier, and 
1 indicating perfect classification 42,43 . 

Statistical comparison/testing of results was conducted using two types of per- 
mutation tests following the theory developed by Good 45 . The first test assessed 
whether a molecular signature had statistically significant classification accuracy and 
tested the null hypothesis of AUC — 0.5. This test was implemented with 1,000 
permutations of the response variable, as described 46 . The second test assessed 
whether a molecular signature obtained with the GLL methodology was significantly 
better or worse than a molecular signature obtained with a different feature/taxa 
selection method. This test was implemented with 10,000 permutations, as 
described 47 . Since both statistical tests assessed a large number of comparisons, we 
performed adjustment for multiple comparisons using methods to control false- 
discovery 48 ' 49 . We used a 5% alpha-level adjusted for multiple comparisons to 
establish statistical significance. 

All experiments presented in this manuscript were run on the Asclepius Compute 
Cluster at the Center for Health Informatics and Bioinformatics (CHIBI) at New York 
University Langone Medical Center (http://www.nyuinformatics.org). 
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